Support Vector Machine


A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. However, SVMs are more commonly used in classification problems and as such, this is what we will focus on in this session.

In the following example, we are trying to separate the data with any of the three lines $H_1$, $H_2$ or $H_3$, and then just assign classes based on whether the observation lies above or below the line. Which line you think is doing the best job ?

image.png

As we can see, $H_1$ does not separate the two classes accurately. $H_2$ does separates but only with a small distance.$H_3$ separates them with the maximum distance. In other words, $H_3$ seems to be the most confident in searating the two classes.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes, as shown in the image below.

image.png

What is a hyperplane?

As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

Support Vectors: Support vectors are the data points (vector points) nearest to the hyperplane. These points are called the support vectors, because only these points are the data observations that “support”, or determine, the decision boundary. Since only these two points are contributing to the result of the algorithm, other points are not, they can be considered the critical elements of a data set.

Margin : The margin is defined as the distance between the separating hyperplane (decision boundary) and the training samples (support vectors) that are closest to this hyperplane.

How do we find the right hyperplane?

Asking this question in similar to asking, how do we best segregate the two classes within the data?

The perpendicular distance between the hyperplane and the nearest data point (support vectors) from either set is known as the margin. The goal is to choose a hyperplane with the greatest possible margin between the hyperplane and any point within the training set, giving a greater chance of new data being classified correctly.

For example, lets look at the figure above again,

Note, finding the largest possible margin allows more accurate classification of new points, making the model a lot more robust. You can see that the new grey point would be assigned correctly to the blue class when using the “H3” hyperplane.

Technically hyperplane can also be called as margin maximizing hyperplane.

Mathematically,

Hyperplane is an $(p - 1)$ dimensional subspace for an $p$-dimensional space. The distance between either side of the dashed line to the solid line is the margin.

image.png

For instance, in two dimensions, a hyperplane is a flat one-dimensional subspace—in other words, a line. In three dimensions, a hyperplane is a flat two-dimensional subspace—that is, a plane. In $p > 3$ dimensions, it can be hard to visualize a hyperplane, but the notion of a $(p - 1)$-dimensional flat subspace still applies.

image.png

Thus the dimension of the hyperplane depends upon the number of features. For a 2-dimension space, its hyperplane will be 1-dimension, which is just a line. For a 3-dimension space, its hyperplane will be 2-dimension, which is a plane that slice the cube. It becomes difficult to imagine when the number of features exceeds 3.

Any Hyperplane can be written mathematically as:

$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 + . . . + \beta_p x_p = 0$$

For a 2-dimensional space, the Hyperplane, which is the line.

$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 = 0$$

The data points above this line, are those $x_1$, $x_2$ satisfy the formula:

$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 > 0$$

By similar logic, data points below this line satisfies:

$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 < 0$$

Assuming the label $y$ is either 1 (for green) or -1 (for red), all those three lines below are separating hyperplanes. Because they all share the same property — above the line, is green; below the line, is red.

image.png

This property can be written in math again as followed:

$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 > 0 \quad if \quad y=1 \quad (Green)$$

$$\beta_0 + \beta_1 x_1 + \beta_2 x_2 > 0 \quad if \quad y=-1 \quad (Red)$$

If we further generalize these two into one, it becomes:

$$y\;(\beta_0 + \beta_1 x_1 + \beta_2 x_2)>0$$

Soft Margin

Sometimes, it may not be possible to separate the two classes perfectly. In such scenarios, a soft-margin is used where some points are allowed to be misclassified or to fall inside the margin.

image.png

Using this example, we can see that the “H4” hyperplane treats the green point inside the margin as an outlier. Hence, the support vectors are the two green points closer to the main group of green points. This allows a larger margin to exist, increasing the model’s robustness.

Applying Soft Margin, SVM tolerates a few dots to get misclassified and tries to balance the trade-off between finding a line that maximizes the margin and minimizes the misclassification.

How much tolerance(soft) we want to give when finding the decision boundary is an important hyper-parameter for the SVM. In Sklearn, it is represented as the penalty term — C.

The C parameter allowed you to decide how much you want to penalize is misclassified points. The bigger the C, the more penalty SVM gets when it makes misclassification.

CAUTION : Beware, while setting a high value for C is likely to lead to a better model performance on the training data, there is a high risk of overfitting the model, producing poor results on the test data.

So, in such a case we try to find a line to separate, but tolerate one or few misclassified dots (e.g. the dots circled in red).

image.png

You can see the difference in the two models. On the right the model just allows a point to be in the classified and on the left there's no miss-classification.

So on the right, with a low C parameter we are prioritizing simplicity so this line here is also called a line with a soft margin because we're allowing that miss classification. On the left, we're prioritizing making no mistakes but in this case we're probably overfitting.

image.png

Low C : SVM model with low value of C cares almost exclusively about maximizing margin and care less about the close points.

Medium C : By setting a medium C, we're willing to give up getting the one orange point correctly classified to have a larger separation between the points that we actually succeeded in classifying and according to a medium C this point would actually be blue.

High C : Using a high C to train our support vector machine in this case we put the decision boundary between an orange point and its immediate neighbor the blue point and that's perfectly classified all of our examples. In other way, this gave us a training accuracy of 100%, essentially everything we use to train the model we predicted correctly. But apart from training accuracy we'd also like to see how the model does on data that it hasn't been exposed to for training i.e., the test data.

In simple words, parameter C allows us to control how much we want to trade off a large separation between classified examples and a perfect classification. So if you have a really high C you're saying I care a lot about getting every single point precisely correct.

But how do you pick the best C parameter? Well what you need to do is try different values see all the way what's the best C value is for your model.

As suggested in sklearn's SVC documentation:

Setting C: C is 1 by default and it’s a reasonable default choice.

If you have a lot of noisy observations you should decrease it: decreasing C corresponds to more regularization.

Once the classifier drawn, it becomes easier to classify a new data instance. Also, because SVM needs only the support vectors to classify any new data instances, it is quite efficient. In other words, it uses a subset of training points in the decision function ( support vectors), so it is also memory efficient.

SVM in Linearly Non-separable Case

In the linearly separable case, SVM is trying to find the hyperplane that maximizes the margin, with the condition that both classes are classified correctly. But in reality, datasets are probably never linearly separable, so the condition of 100% correctly classified by a hyperplane will never be met.

SVM address non-linearly separable cases by introducing two concepts: Soft Margin and Kernel Tricks.

In short,

image.png

Kernel SVM

Another reason why SVMs enjoy high popularity among machine learning practitioners is that they can be easily kernelized to solve nonlinear classification problems.

Obviously, we would not be able to separate samples from the positive and negative class very well using a linear hyperplane as the decision boundary via the linear logistic regression or linear SVM model that we discussed in earlier sections.The solution lies in projecting the samples in higher dimensional.

To understand the concept, let’s take a look at a simple example of data in 1 dimension which is not easy to separate and how adding another dimension makes it easy.

image.png

In this example, the picture on the left shows our original data points. In 1-dimension, this data is not linearly separable, but after applying the transformation $\phi(x) = x^2$ and adding this second dimension to our feature space, the classes become linearly separable.

Explaining this is easy with another simplified example.

image.png

As you can see, red and blue points are not linearly separable since we cannot draw a line that would put these two classes on different sides of such a line. However, we can separate them by drawing a circle with all the blue points inside it and the red points outside it.

How to transform this problem?

In order to classify a dataset like the one above it’s necessary to move away from a 2d view of the data to a 3d view.

Let’s add a third dimension and make it a sum of squared $X_1$ and $X_2$ values i.e., $Z = X_1^2 + X_2^2$ .

Using this three-dimensional space with $X_1$, $X_2$ and $Z$ coordinates, we can now draw a hyperplane (flat 2D surface) to separate red and blue points.

SVM_Polykernel2.gif

The basic idea behind kernel methods to deal with such linearly inseparable data is to create nonlinear combinations of the original features to project them onto a higher dimensional space via a mapping function $\phi(.)$ where it becomes linearly separable. As shown in the next figure, we can transform a two-dimensional dataset onto a new three-dimensional feature space where the classes become separable via the following projection:

$$\phi(x_1,x_2)=(z_1,z_2,z_3)=(x_1,x_2,{x_1}^2+{x_2}^2)$$

This allows us to separate the two classes shown in the plot via a linear hyperplane that becomes a nonlinear decision boundary if we project it back onto the original feature space:

image.png

What are Kernel Methods?

Kernel methods represent the techniques that are used to deal with linearly inseparable data or non-linear data set shown in above examples. The idea is to create nonlinear combinations of the original features to project them onto a higher-dimensional space via a mapping function, $\phi$ , where the data becomes linearly separable.

By using the above method, the training data can be transformed into a higher-dimensional feature space via a mapping function $\phi$ and a linear SVM model can be trained to classify the data in this new feature space. However, this method is computationally very expensive.

The kernel function is a function which can be represented the dot product of the mapping function (kernel method) and can be represented as the following:

$$k(x_i,x_j) = \phi(x_i)\;\phi(x_j)$$

Kernel Trick

The scheme described so far is attractive due to its simplicity. However, considering the computational consequences of increasing the dimensionality, dataset transformations will incur serious computational and memory problems.

Here is a concrete example: the Polynomial Kernel is a kernel often used with SVMs. For a dataset in $\mathbb{R}^2$, a two-degree polynomial kernel (implicitly) performs the transformation $[x_1, x_2] = [{x_1}^2, {x_2}^2, \sqrt{2} \cdot x_1 \cdot x_2, \sqrt{2 \cdot c} \cdot x_1, \sqrt{2 \cdot c} \cdot x_2, c]$. This transformation adds three additional dimensions $\mathbb{R}^2$ $\rightarrow$ $\mathbb{R}^5$. In real applications, there might be many features in the data and applying transformations that involve many polynomial combinations of these features will lead to extremely high and impractical computational costs. Usually, the computational cost will increase, if the dimension of the data increases.

Kernel trick owe their name to the use of various kernel functions, which enable them to operate in a high-dimensional, implicit feature space without ever computing the coordinates of the data in that space. Any linear model can be turned into a non-linear model by applying the kernel trick to the model: replacing its features (predictors) by a kernel function.

So, Kernel Trick is a simple and computationally cheaper method where a Non Linear data is projected onto a higher dimension space so as to make it easier to classify the data where it could be linearly divided by a plane without explicit computation of the coordinates.

SVM-KERNEL_TRICK2_edit.png

Our kernel function accepts inputs in the original lower dimensional space and returns the dot product of the transformed vectors in the higher dimensional space. As long as we can compute the inner product in the feature space, we do not require the mapping explicitly. This operation is often computationally cheaper than the explicit computation of the coordinates and thus called Kernel Trick.

In layman's terms, a kernel function is a shortcut that helps us do certain calculation faster which otherwise would involve computations in higher dimensional space.

SVM-Kernal-1.jpg

Kernel function reduces the complexity of finding the mapping function. So, the kernel function defines the inner product in the transformed space. There are out-of-box kernel functions such as some of the following which can be applied for training models using the SVM algorithm:

image.png

In vector form, they are represented as :

image.png

Think of the polynomial kernel as a transformer to generate new features by applying the polynomial combination of all the existing features.

For degree-d polynomials, the polynomial kernel is defined as:

image.png

where $x$ and $y$ are vectors in the input space, i.e. vectors of features computed from training or test samples.

image.png

One of the most widely used kernels is the Radial Basis Function kernel (RBF kernel) or Gaussian kernel which is also the default kernel used within the sklearn’s SVM classification algorithm. This function is as follows:

image.png

The minus sign inverts the distance measure into a similarity score and, due to the exponential term, the resulting similarity score will fall into a range between 1 (for exactly similar samples) and 0 (for very dissimilar samples).

Here, $\gamma = {{1} \over {2 \sigma^2}}$ is a free parameter that is to be optimized. The optimization of $\gamma$ also plays an important role in controlling overfitting.

In the figure given, we can see an example of decision boundary around the classes 0 and 1 using a RBF kernel with relatively large value of $\gamma$.

image.png

SVM_Complex%20Kernel.gif

Similar to the penalty term — C in the soft margin, Gamma is a hyperparameter that we can tune for when we use SVM with kernel. Gamma control how wiggling the SVM decision boundary could be.

If we increase the value for $\gamma$, we increase the influence or reach of the training samples, which leads to a softer decision boundary, thereby the more wiggling the boundary will be

image.png

In Sklearn — svm.SVC(), we can choose ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable as our kernel.

image.png

The kernel trick sounds like a “perfect” plan. However, one critical thing to keep in mind is that when we map data to a higher dimension, there are chances that we may overfit the model. Thus choosing the right kernel function (including the right parameters) are of great importance.

Extra Reading


In SVMs, our optimization objective is to maximize the margin. The margin is defined as the distance between the separating hyperplane (decision boundary) and the training samples that are closest to this hyperplane, which are the so-called support vectors. This is illustrated in the below figure.

image.png

To get the better understanding for the margin maximization, we may want to take a closer look at those positive and negative hyperplanes that are parallel to the decision boundary.

Mathematically,

The hyperplanes can be expressed like this:

$$w_0+w^Tx_{positive}=1$$$$w_0+w^Tx_{negative}=-1$$

If we subtract the two linear equations from each other, we get:

$$w^T \left( x_{positive}-x_{negative} \right) = 2$$

We can normalize this by the length of the vector w, which is defined as follows:

$$w = \sum_{j=1}^m w_j^2$$

Then, we'll have the following:

$$\frac {w^T \left( x_{positive}-x_{negative} \right)} {\Vert w \Vert} = \frac {2} {\Vert w \Vert}$$

The LHS of the equation can then be interpreted as the distance between the positive and negative hyperplane, which is the margin that we want to maximize.

image.png

Now the objective function of the SVM becomes the maximization of this margin by maximizing $\frac {2}{\Vert w \Vert}$ with the constraint of the samples being classified correctly. In other words, all negative samples should fall on one side of the negative hyperplane while all the positive samples should fall behind the positive hyperplane.

So, the constraint should look like this:

$$w_0+w^Tx^{(i)} = \begin{cases} \ge 1 & \quad if \; y^{(i)}=1 \\ \lt -1 & \quad if \; y^{(i)}=-1 \end{cases}$$

We can write it more compact form:

$$y^{(i)} \left ( w_0+w^Tx^{(i)} \right ) \ge 1 \quad \forall i$$

In practice, though, it is easier to minimize the reciprocal term ${1 \over 2}{{\Vert w \Vert}}^2$, which can be solved by quadratic programming, to be specific a Constrained Optimization Problem using Lagrange Multipliers. However, a detailed discussion about it is beyond the scope of this section.

Slack variables

The reason for introducing the slack variable $\xi$ is that the linear constraints need to be relaxed for nonlinearly separable data to allow convergence of the optimization in the presence of misclassifications under the appropriate cost penalization which led to the so-called soft-margin classification.

The slack variable simply can be added to the linear constraints:

$$w^Tx^{(i)} = \begin{cases} \ge 1 & \quad if \; y^{(i)}=1-\xi^{(i)} \\ \lt -1 & \quad if \; y^{(i)}=-1 +\xi^{(i)}\end{cases}$$

So, the new objective to be minimized becomes:

$${\Vert w \Vert}^2 + C \left ( \sum_i \xi^{(i)} \right )$$

With the variable C, we can control the penalty for misclassification.

Large values of C correspond to large error penalties while we are less strict about misclassification errors if we choose smaller values for C.

We can then use the parameter C to control the width of the margin and therefore tune the bias-variance trade-off as shown in the picture below:

image.png

Using the kernel trick to find separating hyperplanes in higher dimensional space

To solve a nonlinear problem using an SVM, we transform the training data onto a higher dimensional feature space via a mapping function $\phi(.)$ and train a linear SVM model to classify the data in this new feature space. Then we can use the same mapping function $\phi(.)$ to transform new, unseen data to classify it using the linear SVM model.

The basic idea to deal with linearly inseparable data is to project it onto a higher dimensional space where it becomes linearly separable. Let us call this nonlinear mapping function $\phi$ so that the mapping of a sample $\mathbf{x}$ can be written as $\mathbf{x} \rightarrow \phi(\mathbf{x})$, which is called “kernel function.”

Now, the term “kernel” describes a function that calculates the dot product of the features of $\mathbf{x}$ under $\phi$.

\begin{equation}\kappa(\mathbf{x_i, x_j}) = \phi (\mathbf{x_i}) \phi (\mathbf{x_j})^T \end{equation}

In other words, the function $\phi$ maps the original $d$-dimensional features into a larger, $k$-dimensional feature space by creating nononlinear combinations of the original features. For example, if $\mathbf{x}$ consists of 2 features:

$$\mathbf{x} = \big[x_1 \quad x_2\big]^T \quad \quad \mathbf{x} \in {\rm I\!R}^d$$$$\Downarrow \phi$$$$\mathbf{x}' = \big[x_1 \quad x_2 \quad x_1 x_2 \quad x_{1}^2 \quad x_1 x_{2}^3 \quad \dots \big]^T \quad \quad \mathbf{x} \in {\rm I\!R}^k (k >> d)$$

However, one problem with this mapping approach is that the construction of the new features is computationally very expensive, especially if we are dealing with high-dimensional data. This is where the so-called kernel trick comes into play.

Although we didn't go into much detail about the mathematics involved in kernel trick. In practice all we need is to replace the dot product ${x^{(i)}}^Tx^{(j)}$ by $\phi({x^{(i)}}^T)\;\phi(x^{(j)})$. In order to save the expensive step of calculating this dot product between two points explicitly, we define a so-called kernel function:

$$k(x^{(i)},x^{(j)}) = \phi({x^{(i)}}^T)\;\phi(x^{(j)})$$

As discussed earlier, sometimes it is impossible to find a single line to separate the two classes (red and blue) in the input space. But, after projecting the data in to a higher dimension (i.e. feature space in the figure), we could able to find the hyperplane which classifies the data. Kernel helps to find a hyperplane in the higher dimensional space without increasing the computational cost much. Usually, the computational cost will increase, if the dimension of the data increases.

How come Kernel doesn’t increase the computational complexity?

We knew that the dot product of same dimensional two vectors gives a single number. Kernel utilizes this property to compute the dot product in a different space without even visiting the space.

Assume that, we have two features. It means that dimension of the data point is $\rm I\!R^2$

$$ \mathbf{x_i} = \begin{bmatrix} x_{i1} \\ x_{i2}\\ \end{bmatrix} \quad $$

$i$ in the $x$ subscript represent the data points. Similarly, 1 and 2 in the $x$ subscript denote the features. Also, assume that we are applying some transformation function to convert two dimensional input space (two features) in to four dimensional feature space which is $(x_{i1}^2,x_{i1}x_{i2},x_{i2}x_{i1},x_{i2}^2)$

To calculate the dot product of two vectors in the four dimensional space/transformed space, the standard way is

  1. Convert each data point from $\rm I\!R^2 \rightarrow \rm I\!R^4$ by applying the transformation.

We have taken two data points $xi$ and $xj$ thus,

$$\phi(\mathbf x_{i}) = \begin{bmatrix} x_{i1}^2 \\ x_{i1}x_{i2}\\ x_{i2}x_{i1}\\ x_{i2}^2 \end{bmatrix} \quad \quad \quad \phi(\mathbf x_{j}) = \begin{bmatrix} x_{j1}^2 \\ x_{j1}x_{j2}\\ x_{j2}x_{j1}\\ x_{j2}^2 \end{bmatrix} \quad $$
  1. Dot product the two vectors.
$$\phi(x_{i}).\phi(x_{j})$$

As stated before, Kernel function calculates the dot product in the different space without even visiting it. Kernel function for the above transformation is:

$$K(\mathbf{x_i, x_j}) = (\mathbf x_{i}^T \; \mathbf x_{j} )^2$$

Example:

Lets say, $ \mathbf{x_i} = \begin{bmatrix} 1 \\ 2\\ \end{bmatrix} \quad \mathbf{x_j} = \begin{bmatrix} 3 \\ 5\\ \end{bmatrix} \quad $

The dot product in the four dimensional space by the standard way is

$ = \begin{bmatrix} 1\\ 2\\ 2\\ 4 \end{bmatrix} . \begin{bmatrix} 9 \\ 15\\ 15\\ 25\\ \end{bmatrix} \quad = 9 + 30 + 30 + 100 = 169 $

The above dot product can be calculated using the above kernel function without even transforming the original space.

$$K(\mathbf{x_i, x_j}) = \begin{bmatrix} 1 \\ 2\\ \end{bmatrix} . \begin{bmatrix} 3 \\ 5 \\ \end{bmatrix} \quad = (3 + 10)^2 = 169 $$

Concretely, when you see the pattern of $\mathbf x_{i}^T \; \mathbf x_{j}$ ( the dot product of $x_{i}$ transpose and $x_{j}$ , where $x$ is an observation ), $\mathbf x_{i}^T \; \mathbf x_{j}$ can be replaced by $K(\mathbf{x_i, x_j})$

A detailed but simplified explaination of Kernel Function can be found here.